Problem Statement

Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries

In [1]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 31.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 49.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 40.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.8/294.8 kB 11.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.6/9.6 MB 30.9 MB/s eta 0:00:00
  WARNING: The scripts f2py, f2py3 and f2py3.10 are installed in '/root/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.10.1 requires pandas<2.2.3dev0,>=2.0, but you have pandas 1.5.3 which is incompatible.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 1.5.3 which is incompatible.
mizani 0.13.0 requires pandas>=2.2.0, but you have pandas 1.5.3 which is incompatible.
mlxtend 0.23.3 requires scikit-learn>=1.3.1, but you have scikit-learn 1.2.2 which is incompatible.
plotnine 0.14.3 requires matplotlib>=3.8.0, but you have matplotlib 3.7.1 which is incompatible.
plotnine 0.14.3 requires pandas>=2.2.0, but you have pandas 1.5.3 which is incompatible.
xarray 2024.10.0 requires pandas>=2.1, but you have pandas 1.5.3 which is incompatible.

Note:

  1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.

  2. On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [2]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
# to split data into training and test sets
from sklearn.model_selection import train_test_split

# to build decision tree model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# to tune different models
from sklearn.model_selection import GridSearchCV

# to compute classification metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    recall_score,
    precision_score,
    f1_score,
)

Loading the dataset

In [3]:
# Mounting Google Colab drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [4]:
# loading the dataset
customer_data = pd.read_csv("/content/drive/MyDrive/Python/Loan_Modelling.csv")
In [5]:
# copying the data to another variable to avoid any changes to original data
df = customer_data.copy()

Data Overview

  • Observations
  • Sanity checks

Viewing the first and last 5 rows of the dataset

In [6]:
# viewing the first 5 rows of the data
df.head()
Out[6]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [7]:
# viewing the last 5 rows of the data
df.tail()
Out[7]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

Checking the shape of the dataset.

In [8]:
#Checking the shape of the dataset
df.shape
Out[8]:
(5000, 14)
  • The dataset has 5000 rows and 14 columns

Checking the attribute types

In [9]:
# checking datatypes and number of non-null values for each column
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
  • There are 7 numeical and 7 categorical variables in data
  • Personal_Loan,Securities_Account,CD_Account,Online,CreditCard are although interpreted here as numerical are categorical variables one hot encoded by default.
  • Education and Family size are Ordinal categorical vaiables encoded by default

Checking the statistical summary

In [10]:
#Checking statistical Summary of the data frame
df.describe(include="all").T
Out[10]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0
  • Age of the customers range from 23 years to 67 yers with the average being 45 years and 50% being below 45 years.
  • Negative values in Experience column seen which needs further exploration and treatment if required.
  • Average Inome of customers is ~74K dollars annually,however the minimum to max income ranges from 8k to 224K dollars.
  • Avg credit card spending is ~2k dollars, with 75% of the customers spending less than 2.5k dollars per month.
  • 50 % customers have no mortgage and 75% of the customers have mortgage less than 101 K Anually wich implies approximately 25 % of the customers have mortgages ranging between 0 to 101 K thousand dollars Anually.

Checking for missing values in Data

In [11]:
# checking for missing values
df.isnull().sum()
Out[11]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0

  • There are no missing values in the data.

Checking the data duplication

In [12]:
# checking for duplicate values
df.duplicated().sum()
Out[12]:
0
  • There is no dupliate data

Checking the data uniqueness

In [13]:
# checking the number of unique values in each column
df.nunique()
Out[13]:
0
ID 5000
Age 45
Experience 47
Income 162
ZIPCode 467
Family 4
CCAvg 108
Education 3
Mortgage 347
Personal_Loan 2
Securities_Account 2
CD_Account 2
Online 2
CreditCard 2

In [14]:
#check unqiue values of Family
unique_values = df['Family'].unique()
print(f"Unique values in 'Column1': {unique_values}")
Unique values in 'Column1': [4 3 1 2]
  • There are 476 unique ZIPCodes
  • The unique values of family size are 1,2,3,4

Investigating Zipcode

In [15]:
unique_values = df['ZIPCode'].unique()
print(f"Unique values in 'Column1': {unique_values}")
Unique values in 'Column1': [91107 90089 94720 94112 91330 92121 91711 93943 93023 94710 90277 93106
 94920 91741 95054 95010 94305 91604 94015 90095 91320 95521 95064 90064
 94539 94104 94117 94801 94035 92647 95814 94114 94115 92672 94122 90019
 95616 94065 95014 91380 95747 92373 92093 94005 90245 95819 94022 90404
 93407 94523 90024 91360 95670 95123 90045 91335 93907 92007 94606 94611
 94901 92220 93305 95134 94612 92507 91730 94501 94303 94105 94550 92612
 95617 92374 94080 94608 93555 93311 94704 92717 92037 95136 94542 94143
 91775 92703 92354 92024 92831 92833 94304 90057 92130 91301 92096 92646
 92182 92131 93720 90840 95035 93010 94928 95831 91770 90007 94102 91423
 93955 94107 92834 93117 94551 94596 94025 94545 95053 90036 91125 95120
 94706 95827 90503 90250 95817 95503 93111 94132 95818 91942 90401 93524
 95133 92173 94043 92521 92122 93118 92697 94577 91345 94123 92152 91355
 94609 94306 96150 94110 94707 91326 90291 92807 95051 94085 92677 92614
 92626 94583 92103 92691 92407 90504 94002 95039 94063 94923 95023 90058
 92126 94118 90029 92806 94806 92110 94536 90623 92069 92843 92120 95605
 90740 91207 95929 93437 90630 90034 90266 95630 93657 92038 91304 92606
 92192 90745 95060 94301 92692 92101 94610 90254 94590 92028 92054 92029
 93105 91941 92346 94402 94618 94904 93077 95482 91709 91311 94509 92866
 91745 94111 94309 90073 92333 90505 94998 94086 94709 95825 90509 93108
 94588 91706 92109 92068 95841 92123 91342 90232 92634 91006 91768 90028
 92008 95112 92154 92115 92177 90640 94607 92780 90009 92518 91007 93014
 94024 90027 95207 90717 94534 94010 91614 94234 90210 95020 92870 92124
 90049 94521 95678 95045 92653 92821 90025 92835 91910 94701 91129 90071
 96651 94960 91902 90033 95621 90037 90005 93940 91109 93009 93561 95126
 94109 93107 94591 92251 92648 92709 91754 92009 96064 91103 91030 90066
 95403 91016 95348 91950 95822 94538 92056 93063 91040 92661 94061 95758
 96091 94066 94939 95138 95762 92064 94708 92106 92116 91302 90048 90405
 92325 91116 92868 90638 90747 93611 95833 91605 92675 90650 95820 90018
 93711 95973 92886 95812 91203 91105 95008 90016 90035 92129 90720 94949
 90041 95003 95192 91101 94126 90230 93101 91365 91367 91763 92660 92104
 91361 90011 90032 95354 94546 92673 95741 95351 92399 90274 94087 90044
 94131 94124 95032 90212 93109 94019 95828 90086 94555 93033 93022 91343
 91911 94803 94553 95211 90304 92084 90601 92704 92350 94705 93401 90502
 94571 95070 92735 95037 95135 94028 96003 91024 90065 95405 95370 93727
 92867 95821 94566 95125 94526 94604 96008 93065 96001 95006 90639 92630
 95307 91801 94302 91710 93950 90059 94108 94558 93933 92161 94507 94575
 95449 93403 93460 95005 93302 94040 91401 95816 92624 95131 94965 91784
 91765 90280 95422 95518 95193 92694 90275 90272 91791 92705 91773 93003
 90755 96145 94703 96094 95842 94116 90068 94970 90813 94404 94598]
  • The Zip codes are ranging from 90XXX to 96XXX, which seem to be of California.Since these are 467 unique zipcodes, categorising them will mess up the data.
  • A possible approach that can be taken is to categorize on the basis of first 2 digits which will give us 7 unique categories from 90 to 96 and data analysis. From Daomain specific knowlege,one can infer thatin US, we are aware that loan applications are not tied to applicatns zipcodes(ie a banks cannot approve or reject loans based on customers zipcode).So for simplicity we will not take the route of categorising and doing EDA with Zip Codes.We will treat this just treat it as another numerical attribute attribute like ID

Investigating Negative Values in Experience Column

In [16]:
# Find the unique values of experience to check negative values
experiene_values = df['Experience'].unique()
print(experiene_values)
[ 1 19 15  9  8 13 27 24 10 39  5 23 32 41 30 14 18 21 28 31 11 16 20 35
  6 25  7 12 26 37 17  2 36 29  3 22 -1 34  0 38 40 33  4 -2 42 -3 43]
  • Experience has three negative values -3,-2,-1
In [ ]:
# Check how many rows in Experience column has negative values
negative_values = df[df['Experience'] <0].shape[0]
print(negative_values)
52
  • We see 52 rows for Experience columns have a negative value of -1,-2,-3 which does not make practical sense in the data as negative experience.This looks like a data entry error .It is almost 1% of the Data [rows] and a logical approach to treat this will be to convert these values to positive values rather than dropping the rows entirely or imputing with Nan or Median Values

Nan imputation is avoided for an unforseeable issues in model building. Imputation by mean is avoided here as the mean experience is approximately 20 years and that can have asignificant impact on the Experience Data

EXPLORATORY DATA ANALYSIS

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?
  • Before doing EDA, we will establish some facts -ID does not have any implications on analysis, hence we will drop this field
    • Even though the data set is Numerical , we observed that the following 7 will be considered as categorical during EDA -Personal_Loan,Securities_Account,CD_Account,Online,CreditCard,Family and Education
    • We will also define functions -Plot Combined histogram and boxplot -Create Labeled Barplots -Stacked Barplots -Plot distribution with Target

Drop Customer ID

In [17]:
#Customer ID does not have any implication in EDA, hence this column can be dropped
df.drop('ID', axis=1, inplace=True)

Define Plotting Functions

In [18]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [19]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [20]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [21]:
### function to plot distributions wrt target


def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Univariate Analysis

Observations on Income

In [22]:
#Plot Income (Histogram Boxplot)
histogram_boxplot(df, "Income")

Observations

  • The Income distribution is slightly right skewed with a mean Icome around 74 K dollars.
  • There are outliers present.

Observations on CCAvg

In [23]:
#Plot Monthly CC Average Spend (Histogram Boxplot)
histogram_boxplot(df, "CCAvg")

Observations

  • The CC Average Spend is a right skewed distribution with a mean Spend of 1.9k dollars per month.
  • There are outliers present

Observations on Mortgage

In [24]:
#Plot Mortgage(Histogram Boxplot)
histogram_boxplot(df, "Mortgage")
  • Mortgage is a right skewed distribution. We need to note that even though the mean is around 56K dollars, 75% of the customers mortgaes are less thank 101 k.
  • There are ouliers present.

Observations on Age

In [25]:
#Plot Age
histogram_boxplot(df, "Age")
  • Age is a normal distribution with mean Age is around 45 years.
  • There are no outliers present.

Observations on Experience

In [26]:
#Plot Experience
histogram_boxplot(df, "Experience")
  • Experience is a normal distribution with mean experince being 20 years
  • There are no outliers present.

Observations on ZipCode

In [27]:
#Plot ZIPCode
histogram_boxplot(df, "ZIPCode")
  • Zipcodes lie between 90XXX to 96XXX .
  • The are no outliers

Observations on Family Size

In [28]:
# Observations on Family
labeled_barplot(df, "Family", perc=True)
  • Customers distribution across family size is somewhat normal ,not a distinguising factor noticed here.

Observations on Education

In [29]:
labeled_barplot(df, "Education", perc=True)
  • ~42% customers are undergrad Education Level and 58% are grad and advanced

Observations on Personal Loan

In [30]:
labeled_barplot(df, "Personal_Loan", perc=True)
  • 9.6% customers purchased personal Loan

Bivariate Analysis

In US, banks do not co relate loan eligibilty criteria with Zipcodes.Hence this field can be drooped from bivariate analysis as it does not does not have any implication

In [ ]:
df.drop('ZIPCode', axis=1, inplace=True)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-22-198397bab457> in <cell line: 1>()
----> 1 df.drop('ZIPCode', axis=1, inplace=True)

/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   5579                 weight  1.0     0.8
   5580         """
-> 5581         return super().drop(
   5582             labels=labels,
   5583             axis=axis,

/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   4786         for axis, labels in axes.items():
   4787             if labels is not None:
-> 4788                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4789 
   4790         if inplace:

/usr/local/lib/python3.10/dist-packages/pandas/core/generic.py in _drop_axis(self, labels, axis, level, errors, only_slice)
   4828                 new_axis = axis.drop(labels, level=level, errors=errors)
   4829             else:
-> 4830                 new_axis = axis.drop(labels, errors=errors)
   4831             indexer = axis.get_indexer(new_axis)
   4832 

/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
   7068         if mask.any():
   7069             if errors != "ignore":
-> 7070                 raise KeyError(f"{labels[mask].tolist()} not found in axis")
   7071             indexer = indexer[~mask]
   7072         return self.delete(indexer)

KeyError: "['ZIPCode'] not found in axis"

HeatMap

In [34]:
# defining the size of the plot
plt.figure(figsize=(30, 20))

# plotting the heatmap for correlation

# defining the list of numerical features to plot
num_features = ['Age', 'Experience','ZIPCode', 'Income','CCAvg','Mortgage']
#sns.pairplot(data=df,vars=num_features, diag_kind="kde",corner='True',hue='Personal_Loan')
plt.show()

sns.heatmap(
    df[num_features].corr(),annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
);
<Figure size 3000x2000 with 0 Axes>
  • Age and Experience are strongly positively co-related
  • Positive co-relation(.65) between income and CCAvg monthly spend
  • Motgage and income are positively co-related(.21)& Mortage and CCAvg Spend are (.11) but not as strong as Income and CCAvg Spend
  • CD Account is positively corelated Credit Card,Online Users,CC Avg spend and Income , but not very strong

Pair Plot

In [35]:
# defining the figure size
plt.figure(figsize=(15, 10))

# defining the list of numerical features to plot
num_features = ['Age', 'Experience','ZIPCode', 'Income','CCAvg','Mortgage']
sns.pairplot(data=df,vars=num_features, diag_kind="kde",corner='True',hue='Personal_Loan')
plt.show()
<Figure size 1500x1000 with 0 Axes>
  • Income close to 100 K and above have purchased personal loans(PL)
  • Customers with Education level(2,3) ,income above 100K ,higher CCAvg spend and higher mortage (350 k Plus) tend to Purchase PL previous PL purchasers
  • increasing CC Avg Spend , Income and Family size(3,4) are more likely to purchase PL
  • Higher Mortgage Customers have had more purchases of PL

Target Variable with categories

Let's see how the target variable varies across the type of Family

In [ ]:
stacked_barplot(df, "Family", "Personal_Loan")
Personal_Loan     0    1   All
Family                        
All            4520  480  5000
4              1088  134  1222
3               877  133  1010
1              1365  107  1472
2              1190  106  1296
------------------------------------------------------------------------------------------------------------------------
  • Around 55% of the loan purcahes are from 3,4 family size
In [ ]:
stacked_barplot(df, "Education", "Personal_Loan")
Personal_Loan     0    1   All
Education                     
All            4520  480  5000
3              1296  205  1501
2              1221  182  1403
1              2003   93  2096
------------------------------------------------------------------------------------------------------------------------
  • Around 80% of the loan purchases are from graduate and profession educational level
In [ ]:
stacked_barplot(df, "CreditCard", "Personal_Loan")
Personal_Loan     0    1   All
CreditCard                    
All            4520  480  5000
0              3193  337  3530
1              1327  143  1470
------------------------------------------------------------------------------------------------------------------------

** Around 70% of loan purcahes are non Credit Card users showing no obvious relation to CreditCard users and previous loan purchasers

In [ ]:
stacked_barplot(df, "Online", "Personal_Loan")
Personal_Loan     0    1   All
Online                        
All            4520  480  5000
1              2693  291  2984
0              1827  189  2016
------------------------------------------------------------------------------------------------------------------------
  • 60 % of customers access online account out of which ~10% pruchased loan before.This can be a significant driving factor
In [ ]:
stacked_barplot(df, "Securities_Account", "Personal_Loan")
Personal_Loan          0    1   All
Securities_Account                 
All                 4520  480  5000
0                   4058  420  4478
1                    462   60   522
------------------------------------------------------------------------------------------------------------------------
  • 10% of users have Securities account out of which 11% purchased loan( which is lesser important stat)
In [ ]:
stacked_barplot(df, "CD_Account", "Personal_Loan")
Personal_Loan     0    1   All
CD_Account                    
All            4520  480  5000
0              4358  340  4698
1               162  140   302
------------------------------------------------------------------------------------------------------------------------
  • ~6% of CD customers bought loan out out of which 5% bought loan .Significantly less important.

Target Variablle Distributions

Let's analyze the relation between Income and Personal Loan.

In [ ]:
distribution_plot_wrt_target(df, "Income", "Personal_Loan")
  • Most Personal Loans are over income range of 90-100k.There are outliers indicating Loan for lesser income
In [ ]:
distribution_plot_wrt_target(df, "CCAvg", "Personal_Loan")
  • CCAvg Spend of ~2.3 and above have most personal loan purchases.Outliers indicate purchaes on lessser CCV Avg spend indicating other factors
In [ ]:
distribution_plot_wrt_target(df, "Mortgage", "Personal_Loan")
  • 75 % of customers who have mortage under 100K have taken PL's
In [ ]:
distribution_plot_wrt_target(df, "Age", "Personal_Loan")

Observations

  • Pevious PL have been taken by people of ages from 27-65.

Overall EDA Insighst

  • Income is the most significant variable driving personal Loan Purcase Income is strogly postivtely co related to CCAvg, a higher income indicates stronger CCAvg spend and probablity of Loan PurchaseBigger Family Size and higher CC Avg spend customers have higher PL
  • Cusomers with Higher Eduction (2,3 ) buy more PL's
  • Age does not have a clear significant relation with PL purcahse.
  • 75% of the customers who have mortgage are of value less than 101K, PL is with lower mortage customers are higher.

Data Preprocessing

  • Missing value treatment
    • There are no missing values in the data, no missing value treatment required.
  • Outlier detection and treatment (if needed)
    • There are outliers in Three vaiables, Income, Mortgage and CCAvg.However we will not treat them as they are proper values of data.
  • Feature engineering (if needed)
    • Family and Education are Categorical Variable(Ordinal type), but already of numerical format and hence do not need to be encoded again.
    • Securities_Account, CD_Account, Online, CreditCard are Binary One hot encoded Categorical Variable already of Numerical type, hence do not need encoding again.
    • These 6 will be treated as Categorical variables during modelling.
  • Any other preprocessing steps (if needed)
    • The ID and ZIPCode column have no implication in data modelling, so they needs to be dropped (Already done during EDA)
    • The Negative Values of Experience Column need to be treated
In [37]:
# Dropping ZIP Code, ID column was dropped at EDA
df.drop(['ZIPCode'], axis=1, inplace=True)
In [41]:
# Treating negative values of Experince column by taking absolute values
df['Experience'] = df['Experience'].abs()
  • Preparing data for modeling
    • SInce we are building Decision Tree, scaling data is not strictly required and data is split of feature values and not distance gradients. Hence we will not sclae
In [42]:
# defining the explanatory (independent) and response (dependent) variables
X = df.drop(["Personal_Loan"], axis=1)
y = df["Personal_Loan"]
In [43]:
# specifying the datatype of the independent variables data frame
X = X.astype(float)
X.head()
Out[43]:
Age Experience Income Family CCAvg Education Mortgage Securities_Account CD_Account Online CreditCard
0 25.0 1.0 49.0 4.0 1.6 1.0 0.0 1.0 0.0 0.0 0.0
1 45.0 19.0 34.0 3.0 1.5 1.0 0.0 1.0 0.0 0.0 0.0
2 39.0 15.0 11.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0
3 35.0 9.0 100.0 1.0 2.7 2.0 0.0 0.0 0.0 0.0 0.0
4 35.0 8.0 45.0 4.0 1.0 2.0 0.0 0.0 0.0 0.0 1.0

Creating training and test sets.

In [44]:
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)
In [45]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (3500, 11)
Shape of test set :  (1500, 11)
Percentage of classes in training set:
Personal_Loan
0    0.904
1    0.096
Name: proportion, dtype: float64
Percentage of classes in test set:
Personal_Loan
0    0.904
1    0.096
Name: proportion, dtype: float64

We had seen that around 90.4% of observations belongs to class 0 (No Personal Loan) and 9.6% observations belongs to class 1 (Took Persoanl Loan), and this is preserved in the train and test sets

Model Building

Model Evaluation Criterion

Model can make wrong predictions as:

  • Predicting a customer will not purchase loan but in reality, the customer will take loan(FN)
  • Predicting a customer will take loan but in reality, the customer will not take loan (FP)

Which case is more important?

  • If we predict that a customer will not take loan , the bank does not target the customer with the promotion campaign and lose out on the opportunity to target the customer who will ppurchase the loan then the company will have to bear the cost of repair/replacement and also face equipment downtime losses

  • If we predict that a customer will purchase a loan but in reality, the customer does not take a loan , then the company will have to bear the loss incurred in targetted campaign The campaign promotion cost is budgeted for and not a loss compared to losing out on the opportunity to target the potential customer and convert into revenue.

How to increase revenue ?

The company would want the recall to be maximized, greater the recall score higher are the chances of minimizing the False Negatives.

Creating Functions to calculate different metrics and confusion metrics

  • The model_performance_classification_sklearn function will be used to check the model performance of models.
  • The confusion_matrix_sklearn function will be used to plot confusion matrix.
In [46]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [47]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Decision Tree (default)

In [48]:
model0 = DecisionTreeClassifier(criterion="gini",random_state=42)
model0.fit(X_train, y_train)
Out[48]:
DecisionTreeClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Checking model performance on training set

In [49]:
confusion_matrix_sklearn(model0, X_train, y_train)
In [50]:
decision_tree_default_perf_train = model_performance_classification_sklearn(
    model0, X_train, y_train
)
decision_tree_default_perf_train
Out[50]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

Checking model performance on test set

In [51]:
confusion_matrix_sklearn(model0, X_test, y_test)
In [52]:
decision_tree_default_perf_test = model_performance_classification_sklearn(
    model0, X_test, y_test
)
decision_tree_default_perf_test
Out[52]:
Accuracy Recall Precision F1
0 0.980667 0.909722 0.891156 0.900344
  • Observations
    • Perfect score on the training set indicates overfitting but it performs very good on test data indicating that the model generalised well on to new data(butt there is a small amount of overfitting)

Visualizing the Decision Tree

In [53]:
column_names = list(X.columns)
feature_names = column_names
print(feature_names)
['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
In [54]:
plt.figure(figsize=(20, 30))

out = tree.plot_tree(
    model0,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=True,
    class_names=True,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
  • Observations
    • We can observe that this is a very complex model
In [55]:
# Text report showing the rules of a decision tree -

print(tree.export_text(model0, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2483.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Age <= 27.00
|   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |--- Age >  27.00
|   |   |   |   |--- Income <= 92.50
|   |   |   |   |   |--- CCAvg <= 3.65
|   |   |   |   |   |   |--- Mortgage <= 216.50
|   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |--- Experience <= 18.50
|   |   |   |   |   |   |   |   |   |--- Age <= 43.00
|   |   |   |   |   |   |   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- Education >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- Age >  43.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Experience >  18.50
|   |   |   |   |   |   |   |   |   |--- weights: [24.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |   |   |--- Mortgage <= 94.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Mortgage >  94.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  216.50
|   |   |   |   |   |   |   |--- Income <= 68.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Income >  68.00
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.65
|   |   |   |   |   |   |--- Mortgage <= 89.00
|   |   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  89.00
|   |   |   |   |   |   |   |--- Mortgage <= 99.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  99.50
|   |   |   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |--- Income >  92.50
|   |   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |--- Education >  1.50
|   |   |   |   |   |   |--- Income <= 96.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |--- Income >  96.50
|   |   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |--- CD_Account >  0.50
|   |   |   |--- CCAvg <= 4.25
|   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- CCAvg >  4.25
|   |   |   |   |--- Mortgage <= 38.00
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Mortgage >  38.00
|   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 99.50
|   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Family >  1.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- Income >  99.50
|   |   |   |   |--- Income <= 104.50
|   |   |   |   |   |--- CCAvg <= 3.31
|   |   |   |   |   |   |--- weights: [17.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.31
|   |   |   |   |   |   |--- CCAvg <= 4.25
|   |   |   |   |   |   |   |--- Mortgage <= 124.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  124.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  4.25
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Income >  104.50
|   |   |   |   |   |--- weights: [449.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |--- CCAvg <= 2.05
|   |   |   |   |   |   |--- Experience <= 15.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Experience >  15.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.05
|   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Online >  0.50
|   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 49.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.45
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [28.00, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Experience <= 8.00
|   |   |   |   |   |   |--- weights: [11.00, 0.00] class: 0
|   |   |   |   |   |--- Experience >  8.00
|   |   |   |   |   |   |--- Experience <= 31.50
|   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 231.00
|   |   |   |   |   |   |   |   |   |   |--- CCAvg <= 1.05
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- CCAvg >  1.05
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- Mortgage >  231.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |--- Experience >  31.50
|   |   |   |   |   |   |   |--- weights: [11.00, 0.00] class: 0
|   |   |   |--- CCAvg >  2.45
|   |   |   |   |--- CCAvg <= 4.65
|   |   |   |   |   |--- CCAvg <= 4.45
|   |   |   |   |   |   |--- Age <= 63.50
|   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |--- Age <= 45.00
|   |   |   |   |   |   |   |   |   |--- Experience <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Experience >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  45.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |--- Experience <= 20.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |   |   |   |   |   |--- Experience >  20.50
|   |   |   |   |   |   |   |   |   |--- Age <= 52.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Age >  52.00
|   |   |   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  63.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  4.45
|   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  4.65
|   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- CCAvg <= 0.65
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CCAvg >  0.65
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 215.00] class: 1

Using the above extracted decision rules we can make interpretations from the decision tree model like:

--- Income <= 98.50 --- CCAvg > 2.95 -- CD_Account <= 0.50 -- Age <= 27.00

If the

Income is less than equal to 98.5 k & monthly CCAvg spend is greater than 2.95 & CD Account is less than equal to 0.5 & age is less than equal to 27 , the Customer is most likely to purchase Personal Loan

Note: Interpretations from other decision rules can be made similarly.

In [56]:
importances = model0.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • Observations Income, Education and Family are the top 3 impotant features followed by CCAvg Spend

Decision Tree (with class_weights)

  • If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes

  • In this case, we will set class_weight = "balanced", which will automatically adjust the weights to be inversely proportional to the class frequencies in the input data

  • class_weight is a hyperparameter for the decision tree classifier

In [57]:
model1 = DecisionTreeClassifier(criterion="gini",random_state=42, class_weight="balanced")
model1.fit(X_train, y_train)
Out[57]:
DecisionTreeClassifier(class_weight='balanced', random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Checking performance on training set

In [58]:
confusion_matrix_sklearn(model1, X_train, y_train)
In [59]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model1, X_train, y_train
)
decision_tree_perf_train
Out[59]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
  • Model is able to perfectly classify all the data points on the training set.
  • 0 errors on the training set, each sample has been classified correctly.
  • As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
  • This generally leads to overfitting of the model as Decision Tree will perform well on the training set but will fail to replicate the performance on the test set.

Checking model performance on test set

In [60]:
confusion_matrix_sklearn(model1, X_test, y_test)
In [61]:
decision_tree_perf_test = model_performance_classification_sklearn(
    model1, X_test, y_test
)
decision_tree_perf_test
Out[61]:
Accuracy Recall Precision F1
0 0.978 0.895833 0.877551 0.886598
  • Observations
    • The model is performing well, generalising well the new data and has a small amount of overfitting

Visualizing the Decision Tree

In [62]:
column_names = list(X.columns)
feature_names = column_names
print(feature_names)
['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
In [63]:
plt.figure(figsize=(20, 30))

out = tree.plot_tree(
    model0,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=True,
    class_names=True,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
  • Observations
    • This is a very complex decision Tree
In [64]:
# Text report showing the rules of a decision tree -

print(tree.export_text(model0, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2483.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- Age <= 27.00
|   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |--- Age >  27.00
|   |   |   |   |--- Income <= 92.50
|   |   |   |   |   |--- CCAvg <= 3.65
|   |   |   |   |   |   |--- Mortgage <= 216.50
|   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |--- Experience <= 18.50
|   |   |   |   |   |   |   |   |   |--- Age <= 43.00
|   |   |   |   |   |   |   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- Education >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- Age >  43.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Experience >  18.50
|   |   |   |   |   |   |   |   |   |--- weights: [24.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |   |   |--- Mortgage <= 94.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Mortgage >  94.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  216.50
|   |   |   |   |   |   |   |--- Income <= 68.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Income >  68.00
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.65
|   |   |   |   |   |   |--- Mortgage <= 89.00
|   |   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  89.00
|   |   |   |   |   |   |   |--- Mortgage <= 99.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  99.50
|   |   |   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |--- Income >  92.50
|   |   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |--- Education >  1.50
|   |   |   |   |   |   |--- Income <= 96.50
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |--- Income >  96.50
|   |   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |--- CD_Account >  0.50
|   |   |   |--- CCAvg <= 4.25
|   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- CCAvg >  4.25
|   |   |   |   |--- Mortgage <= 38.00
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Mortgage >  38.00
|   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|--- Income >  98.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 99.50
|   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- Family >  1.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- Income >  99.50
|   |   |   |   |--- Income <= 104.50
|   |   |   |   |   |--- CCAvg <= 3.31
|   |   |   |   |   |   |--- weights: [17.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.31
|   |   |   |   |   |   |--- CCAvg <= 4.25
|   |   |   |   |   |   |   |--- Mortgage <= 124.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  124.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  4.25
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Income >  104.50
|   |   |   |   |   |--- weights: [449.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 113.50
|   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |--- CCAvg <= 2.05
|   |   |   |   |   |   |--- Experience <= 15.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Experience >  15.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.05
|   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Online >  0.50
|   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Income >  113.50
|   |   |   |   |--- weights: [0.00, 49.00] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.45
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [28.00, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Experience <= 8.00
|   |   |   |   |   |   |--- weights: [11.00, 0.00] class: 0
|   |   |   |   |   |--- Experience >  8.00
|   |   |   |   |   |   |--- Experience <= 31.50
|   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 231.00
|   |   |   |   |   |   |   |   |   |   |--- CCAvg <= 1.05
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- CCAvg >  1.05
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- Mortgage >  231.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |--- Experience >  31.50
|   |   |   |   |   |   |   |--- weights: [11.00, 0.00] class: 0
|   |   |   |--- CCAvg >  2.45
|   |   |   |   |--- CCAvg <= 4.65
|   |   |   |   |   |--- CCAvg <= 4.45
|   |   |   |   |   |   |--- Age <= 63.50
|   |   |   |   |   |   |   |--- Family <= 1.50
|   |   |   |   |   |   |   |   |--- Age <= 45.00
|   |   |   |   |   |   |   |   |   |--- Experience <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Experience >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  45.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Family >  1.50
|   |   |   |   |   |   |   |   |--- Experience <= 20.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |   |   |   |   |   |--- Experience >  20.50
|   |   |   |   |   |   |   |   |   |--- Age <= 52.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Age >  52.00
|   |   |   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  63.50
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  4.45
|   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  4.65
|   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- CCAvg <= 0.65
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CCAvg >  0.65
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 215.00] class: 1

In [65]:
importances = model0.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • Observations
    • Income is the most important variable affecting personal loan purchase.Education is a second close followed by Family and CCAvg 4 key factors in customer deciding to take personal loan

Model Performance Improvement

Decision Tree (Pre-pruning)

  • Hyperparameter tuning is crucial because it directly affects the performance of a model.
  • Unlike model parameters which are learned during training, hyperparameters need to be set before training.
  • Effective hyperparameter tuning helps in improving the performance and robustness of the model.
  • The below custom loop for hyperparameter tuning iterates over predefined parameter values to identify the best model based on the metric of choice (recall score).
In [66]:
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]

# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0

# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:

            # Initialize the tree with the current set of parameters
            estimator = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                class_weight='balanced',
                criterion='gini',
                random_state=42
            )

            # Fit the model to the training data
            estimator.fit(X_train, y_train)

            # Make predictions on the training and test sets
            y_train_pred = estimator.predict(X_train)
            y_test_pred = estimator.predict(X_test)

            # Calculate recall scores for training and test sets
            train_recall_score = recall_score(y_train, y_train_pred)
            test_recall_score = recall_score(y_test, y_test_pred)

            # Calculate the absolute difference between training and test recall scores
            score_diff = abs(train_recall_score - test_recall_score)

            # Update the best estimator and best score if the current one has a smaller score difference
            if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
                best_score_diff = score_diff
                best_test_score = test_recall_score
                best_estimator = estimator

# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
Best parameters found:
Max depth: 2
Max leaf nodes: 50
Min samples split: 10
Best test recall score: 1.0
In [67]:
# creating an instance of the best model
model2 = best_estimator

# fitting the best model to the training data
model2.fit(X_train, y_train)
Out[67]:
DecisionTreeClassifier(class_weight='balanced', max_depth=2, max_leaf_nodes=50,
                       min_samples_split=10, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Checking performance on training set

In [68]:
confusion_matrix_sklearn(model2, X_train, y_train)
In [69]:
decision_tree_tune_perf_train = model_performance_classification_sklearn(
    model2, X_train, y_train
)
decision_tree_tune_perf_train
Out[69]:
Accuracy Recall Precision F1
0 0.788 1.0 0.311688 0.475248

*Observations

Checking model performance on test set

In [70]:
confusion_matrix_sklearn(model2, X_test, y_test)
  • The model is giving a generalized result now since the recall scores on both the train and test data are coming to be 1.0 which shows that the model is able to generalize well on unseen data.
In [71]:
decision_tree_tune_perf_test = model_performance_classification_sklearn(
    model2, X_test, y_test
)
decision_tree_tune_perf_test
Out[71]:
Accuracy Recall Precision F1
0 0.784667 1.0 0.308351 0.471358
  • Observation
    • Model has perfect recall for training and test set which means it classifies all Positives and there are no false negatives.However the low precision indicates that there are high no of false positives .
    • Also 78% accuracy does not make this very accurate

Visualizing the Decision Tree

In [72]:
feature_names = list(X_train.columns)
importances = model2.feature_importances_
indices = np.argsort(importances)
In [73]:
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    model2,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
  • Obsevation
    • This is a very simple decision tree
In [74]:
# Text report showing the rules of a decision tree -
print(tree.export_text(model2, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1339.60, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- weights: [63.05, 93.75] class: 1
|--- Income >  92.50
|   |--- Education <= 1.50
|   |   |--- weights: [286.50, 317.71] class: 1
|   |--- Education >  1.50
|   |   |--- weights: [60.84, 1338.54] class: 1

  • Observation

    • If income is less than equal to 92.5 k and CCAvd spend is greater than 2.95 k monthly, the customer is most likely to prchase loan

    • If the income is greater than 92.5 K and education is greated than 1, the customer is likely to purchase loan

In [75]:
importances = model2.feature_importances_
importances
Out[75]:
array([0.        , 0.        , 0.79559208, 0.        , 0.07984355,
       0.12456437, 0.        , 0.        , 0.        , 0.        ,
       0.        ])
In [76]:
# importance of features in the tree building

importances = model2.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • In the pre-pruned decision tree, Income is the mort important feature
  • Education and CCAvg also are decision driving factors.

Decision Tree (Post-pruning)

  • Cost complexity pruning provides another option to control the size of a tree.
  • In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha.
  • Greater values of ccp_alpha increase the number of nodes pruned.
  • Here we only show the effect of ccp_alpha on regularizing the trees and how to choose the optimal ccp_alpha value.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

In [77]:
clf = DecisionTreeClassifier(criterion='gini',random_state=42, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [78]:
pd.DataFrame(path)
Out[78]:
ccp_alphas impurities
0 0.000000e+00 -1.049372e-16
1 1.052677e-19 -1.048320e-16
2 1.543926e-18 -1.032881e-16
3 1.543926e-18 -1.017441e-16
4 1.982541e-18 -9.976159e-17
5 3.298387e-18 -9.646320e-17
6 6.158159e-18 -9.030504e-17
7 9.263555e-18 -8.104149e-17
8 1.652118e-17 -6.452031e-17
9 1.945347e-16 1.300143e-16
10 3.072237e-16 4.372380e-16
11 8.073679e-16 1.244606e-15
12 1.526252e-04 3.052503e-04
13 1.539409e-04 6.131321e-04
14 1.567474e-04 9.266268e-04
15 1.621448e-04 1.250916e-03
16 2.105024e-04 1.882424e-03
17 2.857143e-04 2.168138e-03
18 2.923583e-04 2.460496e-03
19 2.927400e-04 3.338716e-03
20 2.927400e-04 4.216936e-03
21 3.001200e-04 5.117297e-03
22 3.001200e-04 5.417417e-03
23 3.024084e-04 7.534275e-03
24 3.052503e-04 7.839526e-03
25 3.078818e-04 8.147407e-03
26 3.078818e-04 8.455289e-03
27 4.722795e-04 9.399848e-03
28 4.964594e-04 1.188214e-02
29 5.168691e-04 1.239901e-02
30 5.426170e-04 1.294163e-02
31 5.703384e-04 1.351197e-02
32 7.434819e-04 1.648590e-02
33 8.479100e-04 1.818172e-02
34 1.030488e-03 2.024269e-02
35 1.099864e-03 2.134256e-02
36 1.132588e-03 2.360773e-02
37 1.603499e-03 2.521123e-02
38 1.724472e-03 2.693570e-02
39 1.784924e-03 2.872063e-02
40 1.793666e-03 3.410163e-02
41 2.154156e-03 3.840994e-02
42 2.186703e-03 4.059664e-02
43 2.962264e-03 4.355891e-02
44 5.593362e-03 4.915227e-02
45 5.872977e-03 5.502525e-02
46 6.618052e-03 6.164330e-02
47 6.913702e-03 6.855700e-02
48 7.489143e-03 7.604614e-02
49 2.867322e-02 1.047194e-01
50 5.478478e-02 2.142889e-01
51 2.857111e-01 5.000000e-01
In [79]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [80]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=42, ccp_alpha=ccp_alpha, class_weight="balanced",criterion='gini'
    )
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.2857110844037626

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

In [81]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

Recall vs alpha for training and testing sets

In [82]:
recall_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)
In [83]:
recall_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)
In [ ]:
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
In [ ]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
    ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
In [84]:
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0021541564995563077, class_weight='balanced',
                       random_state=42)

Checking model performance on training set

In [85]:
model4 = best_model
confusion_matrix_sklearn(model4, X_train, y_train)
In [86]:
decision_tree_post_perf_train = model_performance_classification_sklearn(
    model4, X_train, y_train
)
decision_tree_post_perf_train
Out[86]:
Accuracy Recall Precision F1
0 0.950571 1.0 0.660118 0.795266

Checking model performance on test set

In [87]:
confusion_matrix_sklearn(model4, X_test, y_test)
In [88]:
decision_tree_post_test = model_performance_classification_sklearn(
    model4, X_test, y_test
)
decision_tree_post_test
Out[88]:
Accuracy Recall Precision F1
0 0.943333 1.0 0.628821 0.772118
  • Obserations
    • Perfect recall, moderate precision, high accurancy and balanced F1 score.
    • Data generalises well indicating consistent performance on traing and test data

Visualizing the Decision Tree

In [89]:
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    model4,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

*Observations

  • Simple desicion tree of less complexity
In [90]:
# Text report showing the rules of a decision tree -

print(tree.export_text(model4, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1339.60, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CCAvg <= 4.35
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- weights: [49.78, 67.71] class: 1
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- weights: [0.00, 26.04] class: 1
|   |   |--- CCAvg >  4.35
|   |   |   |--- weights: [13.27, 0.00] class: 0
|--- Income >  92.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 104.50
|   |   |   |   |--- CCAvg <= 3.31
|   |   |   |   |   |--- weights: [23.23, 0.00] class: 0
|   |   |   |   |--- CCAvg >  3.31
|   |   |   |   |   |--- weights: [5.53, 31.25] class: 1
|   |   |   |--- Income >  104.50
|   |   |   |   |--- weights: [248.34, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [9.40, 286.46] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 114.50
|   |   |   |--- CCAvg <= 2.45
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [29.87, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- weights: [17.70, 41.67] class: 1
|   |   |   |--- CCAvg >  2.45
|   |   |   |   |--- weights: [12.17, 140.62] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- weights: [1.11, 1156.25] class: 1

In [93]:
importances = model4.feature_importances_
indices = np.argsort(importances)
In [94]:
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • Income is the most important feature in the post pruned tree
  • Family size is the second driving factor followed by Education and CCAvg Spend affecting customer taking Personal Loan

Model Performance Comparison and Final Model Selection

In [95]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_default_perf_train.T,
        decision_tree_perf_train.T,
        decision_tree_tune_perf_train.T,
        decision_tree_post_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree (sklearn default)",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[95]:
Decision Tree (sklearn default) Decision Tree with class_weight Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 1.0 1.0 0.788000 0.950571
Recall 1.0 1.0 1.000000 1.000000
Precision 1.0 1.0 0.311688 0.660118
F1 1.0 1.0 0.475248 0.795266
In [96]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        decision_tree_default_perf_test.T,
        decision_tree_perf_test.T,
        decision_tree_tune_perf_test.T,
        decision_tree_post_test.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree (sklearn default)",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[96]:
Decision Tree (sklearn default) Decision Tree with class_weight Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.980667 0.978000 0.784667 0.943333
Recall 0.909722 0.895833 1.000000 1.000000
Precision 0.891156 0.877551 0.308351 0.628821
F1 0.900344 0.886598 0.471358 0.772118

Conclusions

  • The aim was to build a model with highest recall in order to target every positive instance for this business case where missing a positive instance(customer who will buy loan) will be costly.

  • The pre Puning and Post pruning decision trees, both fulfil this crieria. While comparing other metrics like prescion, accurancy , F1 score , Post Pruning decision should be selected because:

    • Prescion is higher (66% for Tran and 63% for test). This indicates moderate number of false positives as compared to Pre pruned where precision is poor.
    • High Accurancy (94%) indication high correct predictions as compared to Pre Pruned which has a lower accurancy level of 78%
    • Balanced F1 score of .79 in Train and .77 in test as compared to poorer level for Pre Pruned.

FINAL MODEL SELECTION

  • Post Pruned Decision Tee will be the final model selection

Actionable Insights and Business Recommendations

  • The model built will target 100% of the customers who have the likelihood of Purchasing loan

  • Based on the insights derived from models above, Banks Marketing department should look into following key observations to predict which segments should be targeted to maximise conversion of liabilty customers

For income <=92.5 ,target customers with
  - Higher Avg CC Spend(2.95 to 4.35k per/months)

For income >92.5 , target accounting in factors Education and Family Size along with CCAvg
    -For lower education level(1) ,
    -smaller families (1,2) who have a higher Avg CC Spend >3.3
    -Bigger families(3,4)

  -For higher education level(2,3) ,
 - Customers with income less than 106.5  and CCAvg<=2.45
 - CCAvg 2.45

For income >114, target all customers